PHE produces numerous reports, bulletins, and communications, and receives large amounts of feedback. In recent years the ability to analyse text as data has developed rapidly, and there are now tools which can help us gain insight from documents and bodies of texts. These tools allow us to rapidly analyse large numbers of documents. This note applies some of these techniques to analysing Duncan Selbie’s Friday Messages.
There are a number of steps:
This analysis is conducted using the statistical package R which is rapidly becoming the main tool for undertaking this kind of analysis.
knitr::opts_chunk$set(echo = FALSE)
First we need to load the libraries for the analysis.
library(knitr)
suppressPackageStartupMessages(library(rvest))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(httr))
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(pdftools))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(ggplot2))
library(tidyverse)
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Conflicts with tidy packages -------------------------------------------------------------
annotate(): ggplot2, NLP
content(): httr, NLP
filter(): dplyr, stats
lag(): dplyr, stats
source("~/themejf.R") ## A standard theme for plots
Then we need to get the data. This process identifies the URLs of the Public Health Matters blog.
## Scraping bulletins
### This is the main stem of the URLs for Public Health Matters Blogs
url <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
page <- read_html(url)
urls <- page %>%
html_nodes("a") %>% # find all links
html_attr("href") # get the url
urls <- unique(urls[grepl("friday",urls)]) ## Select those which are friday messages
url_comment <- urls[stringr::str_detect(urls, "comments$")]
url_category <- urls[stringr::str_detect(urls, "category")]
urls <- urls[!urls %in% url_comment]
urls <- urls[!urls %in% url_category]
Stop words are common English words which occur frequently in all documents and are generally removed for analytical purposes. In addition, there are words common to all bulletins which add little value in analysis - we’ll add these to the stop_words lexicon.
We can do some simple analysis.
And wordclouds…
We can extend the anlaysis further by looking at the distribution of terms in individual bulletins, and then looking for patterns to see if bulletins can be clustered according to content.
First we can look at a single bulletin:
| text | title | date1 | |
|---|---|---|---|
| 2 | We are a healthy nation and we are living longer and in better health. We are incredibly fortunate to have dynamic local Government leaders, with relentlessly hardworking teams and an unrivalled health system in the NHS. We have achieved so much to be proud of, yet we are still faced with the fact that benefiting from more years in good health is not something shared equally across society.Yesterday, we published our first Health Profile for England, bringing together the wealth of population data that we and our partners collect to give a broad picture of the health of people in England today. We’ve also published an easy-to-read blog, outlining the 10 key messages. A big part of our role at PHE is to provide evidence and interpret data, and while this report captures and showcases what we know in a novel way, it also makes plain that health inequalities remain a major theme. As the new report shows, people in the richest areas of the country are enjoying nearly 20 more years in good health than those living in the poorest.The Health Profile for England reinforces that good public health is influenced by much more than healthcare alone. Health and wellbeing for individuals is greatly increased by having a job, having a roof over your head, being part of a community and receiving and giving support to people you care for and about. We want this to be used as a reference point when policymakers are thinking about the broader impacts on health of public policy across government both local and national, the NHS, employers and the voluntary and third sector. Going forward, we want to work with policy makers, decision makers and practitioners to reach more people, particularly those who are vulnerable, with the interventions that are going to enable them to live healthier and longer lives. I do hope you find this inspiring and concerning in equal order and that you help PHE improve the product for future years.Published today is the Government’s new Drugs Strategy, strongly informed by our comprehensive Drugs Evidence Review, published in January this year. It again emphasises the importance for vulnerable people of having a decent job and housing, along with treatment, as being key to a sustained recovery. In addition, it recognises the need for all parts of the health and social care system to work together to improve drug users’ physical and mental health - often badly damaged by long term use.Drug misuse is a complex issue that does not happen in isolation. The strategy's focus on the close partnerships that are needed to create positive change is timely and it gives a clear leadership role to local authorities on the drug prevention and treatment agenda, and PHE will provide help and support in implementing it.This week we published a resource that looks at ways to assist local government both in reducing children and young people’s risk of child sexual exploitation (CSE) and intervening when it does happen. With the support of the Association of Directors of Public Health and the Children’s Commissioner for England we have set out the evidence and produced a framework through which three key local actions can be taken: lead, understand and act. Please do have a look at this.Another factor affecting your health is the natural and built environment. Working with the University of the West of England, PHE has produced a series of infographics summarising the quality and strength of the evidence concentrating on five key built environment topics, including: neighbourhood design, housing, access to healthier food, natural and sustainable environment, and transport. Spatial Planning for Health: An evidence resource, is a practical summary for use by local planners, public health teams and local communities to help them develop Local Plans and deliver building projects on the ground which demonstrate the links between good design and health.And finally, on Wednesday evening at Colindale, our scientific campus in North London, a ceremony took place for the first graduates of our employment programme for people with a learning disability and/or who are on the autism spectrum, called Project SEARCH. The average employment rate for individuals aged 18–24 with a learning disability in the UK is just 7%, but for those involved in Project SEARCH, this rises to 65%. I met the students, their families and their PHE mentors during a moving evening, which saw our students leave us with 800 hours under their belt of meaningful work experience, more employability and a BTEC Entry Level Award and Certificate in Work Skills and I could not be more proud of them and our staff. The students will continue to get support from Project SEARCH to move into mainstream employment and we look forward to hearing about their success and welcoming the next class in September. | https://publichealthmatters.blog.gov.uk/2017/07/14/duncan-selbies-friday-message-14-july-2017/ | 2017-07-14 |
We can then create a per document per term table known as a Document Term Matrix (DTM). We can count the terms per document.
Next we can create the DTM.
corp_dtm
<<DocumentTermMatrix (documents: 10, terms: 1534)>>
Non-/sparse entries: 2413/12927
Sparsity : 84%
Maximal term length: 19
Weighting : term frequency (tf)
We can see which words tend to appear together in the bulletins.
findAssocs(corp_dtm, "sugar", 0.6)
$sugar
businesses encouraging reduction 2020 identified
1.00 1.00 1.00 1.00 1.00
industry remove trusts 14 200,000
1.00 1.00 1.00 1.00 1.00
2019 accept acute add approaches
1.00 1.00 1.00 1.00 1.00
atrial baseline behaviour bullied buy
1.00 1.00 1.00 1.00 1.00
charities charity child.our childhood choice
1.00 1.00 1.00 1.00 1.00
collaborative colleagues.next commitments companies comprises
1.00 1.00 1.00 1.00 1.00
consultations consumers consumption councils deadlines
1.00 1.00 1.00 1.00 1.00
default delivery desire diet efficiency
1.00 1.00 1.00 1.00 1.00
emphasising estates estimate fibrillation.it flexibility
1.00 1.00 1.00 1.00 1.00
foods gaps genetics guidelines instilled
1.00 1.00 1.00 1.00 1.00
joint lower maintains march.obesity meals
1.00 1.00 1.00 1.00 1.00
milestone mind month’s obese ordinated
1.00 1.00 1.00 1.00 1.00
originally outlets parent players portfolio
1.00 1.00 1.00 1.00 1.00
portions providers pub quit reformulation
1.00 1.00 1.00 1.00 1.00
restaurants richmond rise staff.doing subject
1.00 1.00 1.00 1.00 1.00
takeaways template tonnes undeniable uniquely
1.00 1.00 1.00 1.00 1.00
untapped view visitors website food
1.00 1.00 1.00 1.00 0.99
obesity smoke stay gap home
0.95 0.95 0.88 0.87 0.85
voluntary partners free healthier leading
0.85 0.80 0.78 0.75 0.75
environment 2017 potential contribute products
0.67 0.67 0.67 0.67 0.67
quickly scale significant 100 2016
0.67 0.67 0.67 0.67 0.67
blogs british cardiovascular charitable choices
0.67 0.67 0.67 0.67 0.67
commitment complex confidence develop eating
0.67 0.67 0.67 0.67 0.67
emphasises expanding focuses government’s initiative
0.67 0.67 0.67 0.67 0.67
intervention leave outlines overweight playing
0.67 0.67 0.67 0.67 0.67
progress publication reality sectors setting
0.67 0.67 0.67 0.67 0.67
slide steps studies taking unrivalled
0.67 0.67 0.67 0.67 0.67
unwell video care
0.67 0.67 0.61
And the next step is topic modelling - this allows us to analyse the whole body of bulletins and look for themes or topics - groupings of words within and between documents.
corp_lda
A LDA_VEM topic model with 6 topics.
corp_lda_tidy <- tidy(corp_lda)
corp_lda_tidy %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
ggplot(aes(term, beta, fill = factor(topic), label = term)) +
geom_bar(stat = "identity") +
geom_text(hjust = 0, size = 10) +
coord_flip() +
facet_wrap(~topic, ncol =4) +
theme_jf() +
labs(fill = "Topic")
NA
NA
corp_lda_tidy %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_point(aes(colour = factor(topic))) +
geom_line(aes(group = topic)) +
coord_polar() +
facet_wrap(~topic, ncol = 4) +
theme(legend.position = "") +
theme(axis.text.x = element_text(size = 12))
NA